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Abstract 

A phylogenetic network N has vertices corresponding to species and arcs corresponding to direct genetic 
inheritance from the species at the tail to the species at the head. Measurements of DNA are often made on 
species in the leaf set, and one seeks to infer properties of the network, possibly including the graph itself. In the 
case of phylogenetic trees, distances between extant species are frequently used to infer the phylogenetic trees by 
methods such as neighbor-joining. 

This paper proposes a tree-average distance for networks more general than trees. The notion requires a weight on 
each arc measuring the genetic change along the arc. For each displayed tree the distance between two leaves is 
the sum of the weights along the path joining them. At a hybrid vertex, each character is inherited from one of its 
parents. We will assume that for each hybrid there is a probability that the inheritance of a character is from a 
specified parent. Assume that the inheritance events at different hybrids are independent. Then for each displayed 
tree there will be a probability that the inheritance of a given character follows the tree; this probability may be 
interpreted as the probability of the tree. The tree-average distance between the leaves is defined to be the 
expected value of their distance in the displayed trees. 

For a class of rooted networks that includes rooted trees, it is shown that the weights and the probabilities at each 
hybrid vertex can be calculated given the network and the tree-average distances between the leaves. Hence 
these weights and probabilities are uniquely determined. The hypotheses on the networks include that hybrid 
vertices have indegree exactly 2 and that vertices that are not leaves have a tree-child. 
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1 Introduction 

In phylogeny, the evolution of a collection of species is 
modelled via a directed graph in which the vertices are 
species and the arcs indicate direct descent, usually with 
modification as mutations accumulate. The leaves typi- 
cally correspond to extant species, while internal vertices 
typically correspond to presumed ancestors. It has been 
common to assume that the directed graphs are trees, 
but more recently more general networks have also 
been studied so as to include the possibility of hybridi- 
zation of species or lateral gene transfer. General frame- 
works for phylogenetic networks are discussed in [1], 
[2], [3], and [4]. See also the recent book [5]. 

There are many methods to reconstruct phylogenetic 
trees from information such as the DNA of extant spe- 
cies. The most generally accepted methods include 
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maximum parsimony, maximum likelihood, and Baye- 
sian. See [6] for an overview. These methods, however, 
are only heuristic, do not guarantee an optimal solution, 
and can be very time-consuming for a moderate number 
of species. 

Suppose X denotes the set of extant species for some 
analysis, including an outgroup which is used to locate 
the root. The DNA information may be summarized via 
the computation of distances between members of X If 
x, y & X, then d{x, y) summarizes the amount of genetic 
difference between the DNA strings of x and y. In order 
to compensate at least partially for the possibility of 
repeated mutation at the same site, a number of differ- 
ent distances are in use, based on different models of 
mutation. Notable examples include the Jukes-Cantor 
[7], Kimura [8], HKY [9], and log determinant [10], [11] 
distances. The log determinant distance is especially 
interesting in that it can be proved that typically the 
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distances add along the paths, so that the distance along 
a path is the sum of the distances for each edge along 
the path. 

Some fast methods to reconstruct phylogenetic trees 
make use of distances between members of X. Probably 
the most common distance-based method is Neighbor- 
joining [12]. It is computationally fast. It often gives a 
good initial tree with which heuristic methods begin in 
order to find an improved tree by other methods. 
Another more recent method FastME [13], [14] is based 
on the principle of balanced minimum evolution, in 
which one assumes that the correct tree is the one that 
exhibits the minimal total amount of evolution, suitably 
measured. 

Distance-based methods have been rarely used to con- 
struct phylogenetic networks that are not necessarily 
trees. It is true that distances occur in common explora- 
tory methods to display the diversity of trees for the 
same species such as the split decomposition (see [15] 
or an overview in [5]). These distances, however, are not 
derived from any biologically based model of evolution. 

This paper studies a distance on rooted directed net- 
works that is based upon a model of evolution. Con- 
sider, for example, the network N in Figure 1. The root 
is 1 and there is a hybridization event at 7 with parents 
6 and 8. Vertex 7 is called a hybrid vertex or a reticula- 
tion vertex. For some characters, the character state at 7 
is inherited from the parental species 6, while for other 
characters the character state at 7 is inherited from spe- 
cies 8. For character states inherited from 6 the evolu- 
tionary history is best described by the displayed tree 
N p , while for character states inherited from 8 the his- 
tory is best described by the tree N p ; Here p and p are 
parent maps telling the parent of every non-root vertex. 
In the example p(7) = 6 while p'(7) = 8. Each parent 
map p leads to a displayed tree N p . 

In Figure 1, each arc might have a numerical weight 
measuring the amount of genetic change on the arc. In 
either tree N p or N p < the distance between two vertices 
might be plausibly defined as the sum of the weights of 
the edges on the unique path between the vertices. This 



paper explores the possibility that an appropriate dis- 
tance between the vertices in the network N is a 
weighted average of the distances in N p and N p : 

More generally, the trees displayed by a network N 
will be conveniently indexed as N p where p ranges over 
all the parent maps. Let Par(N) denote the set of all par- 
ent maps for N. For each hybrid vertex h, the probability 
that a character of h is inherited from a particular par- 
ent vertex q t will be denoted a(^„ h). Assume that these 
inheritances at different hybrid vertices are independent 
events. Then for each p e Par(N) we obtain that the 
probability Pr(p) that the tree N p models the inheritance 
of a particular character is given by 

Pr{p) = Y\ [u[p(h), h):his hybrid]. 

If x and y are vertices, then the distance between x 
and y in N p , written d{x, y; N p ), is the sum of the 
weights of arcs on the unique path joining x and y in 
N p . The tree-average distance d(x, y; N) between x and y 
in N will be defined to be the expected value of the dis- 
tances in the various trees N p : 

d{x,y;N) = ^ [Pr[p)d{x,y,N p ) : p e Par(N)]. 

If a hybrid vertex h satisfies that each parent q of h 
has the same probability, we will call the inheritance 
equiprobable at h. This special case assumes that the 
contribution from each parent to h is the same; if there 
are two parents, each contributes approximately 50%. 

In Figure 1 note that, for each species in the leafset X 
= {1, 2, 3, 4}, it is plausible that the DNA is available 
since 2, 3, 4 correspond to extant species and 1 to an 
extant outgroup species. Hence it is plausible that we 
know d(x, y; N) for distinct x and y in X, hence (*) = 6 
nonzero distances. Nevertheless, N has 8 arcs and hence 
it is not likely that from the 6 known distances we 
could compute 8 independent weights for these arcs. 
Indeed, the equations obtained in this paper for this net- 
work have infinitely many solutions. There is a possibi- 
lity of simultaneous identical mutations between 6 and 






Figure 1 A network N with root 1, and the two trees N p and N p - that it displays. If N is equiprobable, then 7 inherits approximately half its 
characters from 6 and the other characters from 8. 
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7 and between 8 and 7 which might be confused with 
mutations between 7 and 3. 

In this paper we will assume that the weight of an arc 
into a hybrid vertex is 0. Thus in Figure 1, the weights 
of arcs (6, 7) and (8, 7) will be zero. Under this assump- 
tion vertex 7 corresponds roughly to the immediate off- 
spring of a hybridization event, in which some 
characters came intact from 6 and the remainder intact 
from 8. Further mutation occurred before species 3 
evolved from 7. 

Note that the number of arcs of N in Figure 1 that are 
not directed into a hybrid vertex is 6. It is therefore 
plausible that given the 6 numbers d(x, y; N) for x, y & 
{1, 2, 3, 4}, we might be able to recover the weights for 
each of the 6 arcs in N that are not directed into the 
hybrid vertex 7. These same weights would be utilized 
in distances for both N p and N p : On the other hand, we 
should like to determine an additional parameter a(6, 7) 
telling the probability of inheritance by 7 of a character 
from 6. It is unlikely that six equations, one for each d 
(x, y; N), will uniquely and generically determine seven 
real parameters. Indeed, the methods of this paper for 
this example lead to six equations in seven unknowns 
such that for certain values of the distances the weights 
and probabilities are not uniquely determined. Conse- 
quently for the situation in Figure 1 we will assume that 
a{6, 7) = a{8, 4) = 1/2; we call the inheritance equiprob- 
able at 7. 

By contrast, Figure 2 shows another network with X = 
{r, X\, X2, x-i, y} containing a single hybrid vertex ho. In 
this case there are (2) = 10 distances and 8 arcs not 
into a hybrid vertex, so it is plausible that the 10 equa- 
tions would allow us to uniquely determine a ninth 
parameter a.\ = a{q lt h 0 ) satisfying 0 < a 1 <1. In fact, 
this paper will show how to determine all 9 parameters. 




* y 

Figure 2 A minimal configuration needed to be able to find 
the probability a, = a(</ lf h 0 ) = 1 - a 2 that a character state in 
h 0 is inherited from q v 



Then a(q 2 , h 0 ) = 1 - CCi is also determined. In Figure 2 
we will not need to assume equiprobability at h 0 . 

In order to obtain interesting results, assumptions 
must be made about the network N. As an extreme case 
it would be easy to add many more internal vertices and 
edges to the network N of Figure 1 without adding any 
additional leaves yet increasing arbitrarily the number of 
arcs. For example, Figure 3 shows a network in which 
the network N of Figure 1 has been modified by the 
addition of other arcs. The 6 distances do not determine 
the weights for all 7 arcs that do not lead to a hybrid 
vertex in Figure 3. 

Particular kinds of acyclic networks have been studied 
in various papers. Wang et al. [16] and Gusfield et al. 
[17] study "galled trees" in which all recombination 
events are associated with node-disjoint recombination 
cycles; the idea occurs also earlier in [18]. Choy et al. 
[19] and Van Iersel et al. [20] generalized galled trees to 
"level-/c" networks. Baroni, Semple, and Steel [2] intro- 
duced the idea of a "regular" network, which coincides 
with its cover digraph. Cardona et al. [21] discussed 
"tree-child" networks, in which every vertex not a leaf 
has a child that is not a reticulation vertex. An arc («, b) 
is redundant if there is a directed path from a to b that 
that does not utilize this arc. The current author has 
utilized "normal" networks [22] which are both tree- 
child and contain no redundant arc. 

Most results in this paper assume that the network is 
normal. This means, briefly, that every vertex not in X 
and not a leaf has a tree-child (a child with indegree 
one); and moreover, there is no redundant arc. For 
example, if X = {1, 2, 3, 4} then the network in Figure 1 
is normal while the network in Figure 3 is not normal 
since arc (5,10) is redundant. With the assumption that 




I 3 

Figure 3 This tree-child network has X = {1, 2, 3, 4}. There are 7 
arcs not leading to a hybrid vertex but only 6 distances, and the 
weights are not uniquely determined. The network is not normal 
because arc (5, 10) is redundant. 
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there are no redundant arcs we show in Section 3 that 
for a given network N, the tree-average distance d is a 
metric on X. With the assumption of normality we also 
show that different parent maps p yield different dis- 
played trees N p . Hence the average over the parent 
maps p is the same as the average over displayed trees. 
This result eliminates the logical possibility that differ- 
ent parent maps p\ and p2 might yield displayed trees 
that are topologically the same, yielding an uncertainty 
about which is the correct average to use in the 
definition. 

The main result, Theorem 4.1, assumes that the net- 
work N is normal and also that for all hybrid vertices 
the indegree is exactly 2 and the outdegree is exactly 1. 
At each hybrid vertex h we assume either equiprobabil- 
ity or else that h has a grandparent on at least one side 
of the reticulation cycle, as in Figure 2 but not Figure 1. 
Then from knowledge both of N and of the tree-average 
distance function d, the weights for all arcs are uniquely 
determined and indeed can be computed by explicit for- 
mulas. Moreover, the probabilities of inheritance at each 
hybrid vertex are uniquely determined and can be com- 
puted by explicit formulas. This calculation is, of course, 
trivial if the network is equiprobable at h. 

A model for a distance function containing certain 
parameters is called identifiable if the parameters can be 
reconstructed from the (exact) values of the distance 
function. Theorem 4.1 thus asserts that, if the tree-aver- 
age distance function d on X and the network N are 
known, then the real parameters of the model (i.e., the 
weights and the probabilities) are identifiable in various 
cases. 

A major problem, of course, is the reconstruction of N 
itself from a distance function d. I have obtained partial 
results (not included in this paper) which give a recon- 
struction of N itself when the distance d is the tree- 
average distance and when the network N satisfies the 
hypotheses of Theorem 4.1 and some additional hypoth- 
eses. The reconstruction of N is possible because of the 
simple forms of the formulas obtained in this paper. 
Essentially, the formulas are simple enough that they 
can be used recursively when only part of the network 
is yet known. I plan a subsequent paper which will uti- 
lize the results in the current paper to reconstruct N 
from the tree-average distances. 

The assumption that all hybrid vertices have indegree 
2, assumed in Theorem 4.1, is plausible biologically 
since in sexually reproducing species an offspring arises 
from one egg and one sperm. 

The assumption that there be no redundant arcs is 
essential for Theorem 4.1. Figure 3 displays a tree-child 
network N with X = {1, 2, 3, 4}. There are 6 indepen- 
dent nonzero distances between the members of X, yet 
there are 7 arcs not directed into hybrid vertices. It is 



easy to choose positive values for the tree-average dis- 
tances such that there are infinitely many positive 
choices of the weights given the network. Note that 
each vertex not a leaf has a tree-child, so the network is 
a tree-child network [21]. Hence Theorem 4.1 cannot be 
extended to general tree-child networks. 

Some other extensions of the current results and pro- 
blems are discussed in the concluding section 6. 

2 Fundamental Concepts 

A directed graph or digraph (V, A) consists of a finite 
set V of vertices and a finite set A of arcs, each consist- 
ing of an ordered pair (u, v) where we V , v e V , u * 
v. We interpret (u, v) as an arrow from u to v and say 
that the arc starts at u and ends at v. There are no mul- 
tiple arcs and no loops. If (u, v) e A, say that u is a par- 
ent of v and v is a child of u. A directed path is a 
sequence u 0 , U\, of vertices such that for i = 1, 

k, (u\ . i, Ui) e A. The path is trivial if k = 0. Write u < 
v if there is a directed path starting at u and ending at 
v. The digraph is acyclic if there is no nontrivial directed 
path starting and ending at the same point. If the 
digraph is acyclic, it is easy to see that < is a partial 
order on V . 

The indegree of vertex u is the number of v e V such 
that (v, u) e A. The outdegree of u is the number of v e 

V such that (u, v) e A. A leaf is a vertex of outdegree 0. 
A normal vertex (or tree vertex ) is a vertex of indegree 
1. A hybrid vertex (or reticulation vertex ) is a vertex of 
indegree at least 2. An arc (u, v) is a normal arc if v is a 
normal vertex. 

A digraph {V, A) is rooted if it has a unique vertex r e 

V with indegree 0 such that, for all v e V , r < v. This 
vertex r is called the root. 

Let X denote a finite set. Typically in phylogeny, X is a 
collection of species. Measurements are assumed to be 
possible among members of X, so that we may assume 
that, for example, their DNA is known for each x e X. 

A phylogenetic X-network N = (V, A, r, X) is a rooted 
acyclic digraph G = (V, A) with root r such that there is 
a one-to-one map (p : X — > V whose image contains all 
vertices v such that either 

(i) v is a leaf; or 

(ii) v = r, or 

(iii) v has indegree 1 and outdegree 1. 

There may be additional vertices in X. We will identify 
each x e X with its image (p{x). The set X will be called 
the base-set for N. 

In biology the network gives a hypothesized relation- 
ship among the members of X. It is quite common also 
that a certain extant outgroup species r is assumed to 
have evolved separately from the rest of the species in 
question. When this happens, we identify the species r 
with the root r. Thus extant species (the leaves) are in X 
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by (i) since measurements can be made on them. The 
outgroup r, which is identified with the root, is in X by 
(ii). If a vertex has indegree 1 and outdegree 1 then 
nothing uniquely determines it unless, for fortuitous 
reasons, it is possible to make measurements on its 
DNA, in which case it lies in the base-set X. 

An X-tree is a phylogenetic X-network such that the 
underlying digraph is a tree. 

Figure 4 shows a phylogenetic X-network N with base- 
set X = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}. The root is r = 1. 
Note that the leaves are in X by (i), 1 e X by (ii), and 10 
e X by (iii). Measurements such as DNA are assumed 
possible on members of X. Since the root 1 is actually 
an outgroup and the leaves are all extant, this is plausi- 
ble for all members of X except 10. We are perhaps 
here assuming that, by some fortuitous chance, some 
historical DNA of 10 is also available. 

An arc (u, v) e A is redundant if there exists w e V 
such that u, v, and w are distinct and u < w < v. The 
removal of a redundant arc (u, v) still leaves u < v in 
the network. 

A phylogenetic X-network N = (V, A, r, X) with 
base-set X is normal provided (1) whenever v e V and 
v £ X, then v has a tree-child c; and (2) there are no 
redundant arcs. The networks in Figure 2 and 4 are 
normal, while the network of Figure 3 is not normal. 
The usage here of "normal" differs slightly from that in 
[22] in that here hybrid vertices that are not leaves 
may have outdegree 1, whereas in [22] hybrid vertices 




Figure 4 A normal phylogenetic X-network with X = {1, 2, 3, 4, 

5, 6, 7, 8, 9, 10, 11} and root 1. The root corresponds to an 
outgroup species. Measurements on DNA are assumed possible on 
members of X. 



that were not leaves had outdegree 2 or higher. There 
is an obvious one-to-one relationship between normal 
networks in the current sense and normal networks in 
the previous sense. 

A normal network N is semibinary if each hybrid node 
has indegree 2 and outdegree 1. It follows from normal- 
ity that the child of the hybrid node is necessarily 
normal. 

A normal path in N from v to x is a directed path v = 
v 0 , v lt v k = x such that for i = 1, ... k, v,- is normal. A 
normal path from v to X is a normal path starting at v 
and ending at some x e X. For example, in Figure 4, the 
path 20, 18, 19, 8 is normal and is a normal path from 
20 to X. The path 18, 17, 16, 5 is not normal since 16 is 
hybrid. The trivial path 3 is normal. 

Suppose N is normal and v e V . Then there is a nor- 
mal path from v to X. To see this, if v e X, then the tri- 
vial path is a normal path from v to X. If v £ X, then v 
has a tree child Vi. If Vi e X, then the path v 0 , Vi is a 
normal path to v± in X. Otherwise v x has a tree-child v%. 
If v 2 e X then the path v 0 , V\, v 2 is a normal path from 
v to v 2 in X. Proceeding in this manner, we obtain the 
result. 

Suppose two normal paths shared a common vertex x, 
say the normal paths v = v 0 , v k = x and w = w 0 , Wj 
= x. If k >0 and j >0 then since x is normal with a 
unique parent, it follows that v k _ 1 = Wj t . v Repeating 
the argument we find that either there is an i such that 
v = Wj or else there is an i such that w = v,. This argu- 
ment, of frequent use, is called following the normal 
paths backwards. 

A graph (or, for emphasis, an undirected graph) (V, E) 
consists of a finite set V of vertices and a finite set E of 
edges, each a subset {v 1( v 2 } of V consisting of two dis- 
tinct vertices. Thus an edge has no direction, while an 
arc has a direction. If N = (V, A, r, X) is a phylogenetic 
X-network, there is an associated undirected graph Und 
(N) = (V, E) in which every arc in A has its direction 
ignored; thus E = {{a, b}: {a, b) e A or (b, a) e A}. 

3 The Tree-Average Distance 

If N = (V, A, r, X) is a phylogenetic X-network, then a 
parent map p for N consists of a map p : V -{r} — > V 
such that, for all v e V - {r}, p{v) is a parent of v. Note 
that r has no parent. If v is normal, then there is only 
one possibility for p(v), while if v is hybrid, there are at 
least two possibilities for p{v). In Figure 4, an example 
of a parent map p satisfies p{20) = 23, p{!6) = 17, and 
for all other vertices v besides 1, p{v) is the unique par- 
ent of v. 

Write Par (N) for the set of all parent maps for N. In 
general if there are k distinct hybrid vertices and they 
have indegrees respectively i\, i 2 , i k , then the number 
of distinct parent maps p is \Par (N)\ = ll[ij : J = 1, 
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k] . If N is a network with k distinct hybrid vertices, each 
of indegree 2, then \Par(N)\ = 2 k . 

Given p e Par (A/) the set A p of p-arcs is A p = {(p{v), 
v): v e V - {r}}. The induced tree N p is the directed 
graph (V, j4^,) with root r. Note that each vertex in V - 
{r} has a unique parent in A/p. Thus N p is a tree with 
vertex set V . The set X, however, need not be a base- 
set of N p . For example, if h is hybrid in N, then in N p 
the vertex h has indegree 1 from the arc (p(h), h) and 
outdegree 1, yet need not lie in X. 

Several of the proofs will require the notion of "com- 
plementary parents". Suppose p e Par (N) and h is a 
particular hybrid vertex with exactly two parents q\ and 
q 2 . Assume p(h) = q\. The complementary parent map 
p of p with respect to h is defined by 



Thus p agrees with p except at h, where p' chooses 
the other parent from that chosen by p. 

A phylogenetic X-network is weighted provided that 
for each arc {a, b) e A there is a non-negative number 
w(a, b) called the weight of {a, b) such that 

(1) if b is hybrid, then m[a, b) = 0; 

(2) if b is normal, then co(a, b) > 0. 

We call the function to from the set of arcs to the 
reals the weight function of N. We interpret co {a, b) as 
a measure of the amount of genetic change from species 
a to species b. If h is hybrid with parents q\ and q 2 and 
unique child c, then the hybridization event is essentially 
assumed to be instantaneous between q 1 and q 2 with no 
genetic change in those character states inherited by h 
from q 1 or q 2 respectively. Further mutation then occurs 
from h to c, as measured by a> (h, c). 

In any rooted tree T = {V, A, r), two vertices u and v 
have a unique most recent common ancestor mrca(w, v) 
= mrca(w, v; T ) e V that satisfies 

(1) mrca(w, v) < u and mrca(M, v) < v; 

(2) whenever z < u and z < v, then z < mrca(w, v). 

In a network that is not a tree, two vertices u and v 
need not have a mrca(«, v). 

Suppose that N = (V, A, r, X) is a weighted phyloge- 
netic X-network with weight function co. For each p e 
Par (N) and for each u, v e V , define the distance d(u, 
v; N p ) as follows: in N p there is a unique undirected 
path P [u, v) between u and v; defined [u, v; N p ) to be 
the sum of the weights of arcs along P (u, v). More pre- 
cisely, since N p is a tree, there exists a most recent com- 
mon ancestor m = mrca(w, v; N p ), a directed path P r 
given by m = u 0 , U\, . . ., u^ = u from m to u, and a 
directed path P 2 given by m = v 0 , v lt . . ., Vj = v from m 
to v. Define 



d[u, v.Np) = [<b(uj, Um) : i = 0, 1] + [fflfPi, Vj+l) : i- <>,•■• ,J- l], 

We shall refer to d(u, v; N p ) as the distance between u 
and v in N p . 

Let H denote the set of hybrid vertices of N. For each 
h e H, let P {h) denote the set of parents of h, i.e. the 
set of vertices u such that (u, h) e A. Since he H,\P 
{h)\ > 2. For each u e P (h), let a(u, h) denote the frac- 
tion of the genome that h inherits from u. We may 
interpret a{u, h) as the probability that a character is 
inherited by h from u, so for all h e H, T.[a(u, h): u e P 
(h)] = 1. 

If h and h' are distinct members of H, we will assume 
that the inheritances at h and W are independent. More 
generally, suppose for every h e H that q^ is a parent of 
h. Then we assume that the events that a character at h 
is inherited from q h are independent. It is then easy to 
see that for each p e Par(N) the probability that inheri- 
tance follows the parent map p is Pr (p) = Y[[a{p{h), h): 
he H\- 

The tree-average distance d{u, v; N) between u and v 
in N is defined by 

d(u, v;N) =Y^ [Pr{p)d{u, v;N p ):p<E Par{N)]. 

It is thus the expected value of the distances between 
u and v in the various N p . 

The simplest situation has each parent of h equally 
likely, so a(p(h), h) = 1I\P (h)\ for each p e Par{N). If 
this situation occurs, we call the network equiprobable 
at h. If the network N is equiprobable at h for all h e 
H, then we call the network equiprobable, and for each 
u and v in X, d(u, v; N) is the average of the values d(u, 
v; N p ) for p e Par{N). 

For example, for the network N in Figure 1 suppose 
that the arcs have weights given by £a(l, 5) = 1 = ta(5, 6) 
= co(7, 3), while co(5, 8) = co(8, 4) = 2 and ta(6, 2) = 4. 
Since 7 is hybrid, m{6, 7) = co(8, 7) = 0. Suppose, as in 
Figure 1, the parent map p satisfies p(7) = 6 while the 
parent map p' satisfies p (7) = 8. Then N p shown in Fig- 
ure 1 is obtained from N by deleting the arc (8, 7) while 
N p - is obtained from N by deleting the arc (6, 7). 
Assume a(6, 7) = 1/3 and a(8, 7) = 2/3, so Pr(p) = 1/3, 
Pr(p') = 2/3. To compute a(l, 3; N) we find J(l, 3; A/p) 
= »(1, 5) + t»(5, 6) + co(6, 7) + ffi>(7, 3) = 1 + 1+ 0+1 
= 3, 3; A^,.) = w(l, 5) + fu(5, 8) + m(8, 7) + cy(7, 3) 
= 1 + 2 + 0+1=4. Hence d(l, 3; AO = (l/3)rf(l, 3; A/p) 
+ (2/3)rf(l, 3; A/p.) = (l/3)(3) + (2/3)(4) = 11/3. For 
another example d(l, 2; A/ p ) = d{\, 2; A/^) = 6 so d(l, 2; 
AO = (l/3)(6) + (2/3)(6) = 6. 

Given u and v, the vertices mrca(w, v; A/p) may differ 
for different p. This is seen in Figure 1 where mrca(2, 3; 
A/p) = 6 while mrca(2, 3; A/p.) = 5. 
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Theorem 3.1. Assume N = {V, A, r, X) is a phyloge- 
netic X-network that has no redundant arcs. Assume N 
has a weight function m satisfying that co(a, b) >0 if b is 
normal. Then the tree-average distance on X from N is a 
metric on X. 

Proof. A metric d onX must satisfy 

(1) For all x and y in X, d(x, y) > 0 and d(x, y) = 0 iff X 

= y- 

(2) For all x and y in X, d{x, y) = d(y, x). 

(3) For all x, y, z e X, d(x, z) < d(x, y) + d(y, z). 

For (2), suppose x, y, e X. For all p, d(x, y; N p ) = d(y, 
x; N p ), whence d(x, y; N) = d(y, x; N). 

For (3) suppose x, y, z e X. For each N p , d(x, z; N p ) < 
d(x, y; N p ) + d(y, z; N p ) from the truth of the four-point 
condition, see [23], p 147. Hence the result follows for 
distances in N as well. 

For (1) it is clear that for each p, d{x, y; N p ) > 0, 
whence d(x, y; N) > 0. Moreover, for each p, d{x, x; N p ) 
= 0, whence d{x, x; N) = 0. 

To finish the proof of (1), suppose d(x, y; N) = 0; we 
show x = y. Assume instead x * y. Since the weights are 
nonnegative, for every p we have d(x, y; N p ) = 0. Hence 
for every p e Par(N), in N p the unique path between x 
and y contains only arcs {a, b) with b hybrid in N. 

If x and y are both normal, then for every p the 
unique path between x and y in N p must consist of a 
directed path from v = mrca(^, y; N p ) to x and a path 
from v to y; hence it contains a normal arc whence d(x, 
y; N p ) >0. Thus we may assume that one vertex, say y, 
is hybrid. 

In N choose a directed path P = y 0 , y lt y k = y such 
that ji is not hybrid but y 2 , yk are hybrid. This is 
always possible because there is a directed path from r 
to y, say u 0 = r, U\, u 2 , —, u k = y. The child Ui of r can- 
not be hybrid, because if it were, then its other parent q 
besides r must also have a path to q from r, and this 
path combined with the arc (q, Uj) would make the arc 
(r, Mi) redundant. Moreover, we may choose this path 
so that x does not lie in {y lt y k } since whenever y t is 
hybrid there are at least two choices of the parent j 2 . i, 
and we may select y t . i to be distinct from x. 

If x is normal in N, let Q be the trivial path Zq — x. 
Otherwise we may choose a directed path Q = z 0 , Z\, 
z s = x such that z 0 is not hybrid but all other vertices 
are hybrid. Moreover, we may assume that the vertices 
of Q are all distinct from the vertices of P . This is 
because, if z ; is hybrid, it cannot have two parents q 1 
and q 2 which are on P since then there must be a direc- 
ted path from say q 1 to q 2l whence the arc (q lt zj is 
redundant. 

Since the vertices on P and Q are distinct, there exists 
a parent map p that agrees with all the choices made in 
constructing both P and Q. Hence in N p , P is a path 
from y 0 to y, Q is a path form z 0 to x, and the paths are 



disjoint. In N p let v = mrca(y 0 i z 0 ; N p ). Then in N p the 
unique path between x and y consists of P , Q, a path 
from v to y 0 , and a path from v to Zq. Since y\ and Zq 
are normal, this path includes a normal arc, so d(x, y; 
N p ) >0. It follows that d(x, y; N) >0, a contradiction. □ 

Corollary 3.2. Assume N is a normal network with 
weight function m such that co{a, b) >0 if b is normal. 
Then the tree-average distance on X from N is a metric 
on X. 

The tree-average distance is defined as a weighted 
average in terms of parent maps. Any tree that arises as 
Np for some parent map p is said to be displayed in N. 
There is a logical possibility that several different parent 
maps p could yield essentially the same displayed tree. 
The next theorem gives sufficient conditions so that in 
fact the displayed trees are all distinct. Hence the tree- 
average distance becomes a weighted average over all 
the distinct displayed trees. 

The proof requires the notion of a split. A split of X is 
a partition of X into exactly two nonempty subsets; if 
these are A and B, we write the split A\B. Two splits Ai\ 
B 1 and A 2 \B 2 are compatible if at least one of the sets 
A 1 n A 2 , A 1 n B 2 , By n A 2 , and B± n B 2 is the empty set. 
Removal of any edge e (but not its endpoints) from a 
tree T produces a split Z(e) consisting of vertices in the 
connected components of T with e removed. The set of 
splits of a tree Twill be denoted Z(T). If T is directed, 
then the splits of T are obtained by reference only to 
the undirected tree so Z(T) = Y,(Und(T)). By the Splits- 
Equivalence Theorem (see [23], p. 44) any two splits of 
a tree are compatible. 

Theorem 3.3. Assume N = (V, A, r, X) is a normal 
phylogenetic X-network. Suppose that every hybrid vertex 
that is not a leaf satisfies that it has outdegree 1 and 
that its unique child is normal. Suppose p and q are dis- 
tinct parent maps for N. Then N p and N q are topologi- 
cally distinct trees. 

Proof. We show that T.(N p ) and Z(N q ) are distinct. 
Since p *■ q there exists a hybrid vertex h such that p{h) 

q(h). Let q\ = p{h) and q 2 = q(h). Choose a normal 
path in N from q 1 to JC; g X, a normal path from q 2 to 
x 2 e X, and a normal path from h to y e X. Note that 
each normal path is a path in both N p and N q . More- 
over, q l is normal in N because otherwise its unique 
child would not be a tree-child. Similarly q 2 is normal in 
N. 

If T.{Np) = T.(N q ), then each pair of splits would be 
compatible. In N p consider the split Z(«, qi) where a is 
the unique parent of q 1 and we remove the arc (a, qx) 
from N p . We may write Z(«, qi) as Ajji^ where A l con- 
tains r. The directed path in N p from r to y includes the 
arc («, qi), so B x contains y. Then X\ e B\ because there 
is a path from q 1 to X\ and from h to y, neither of which 
includes {a, qi). Moreover, x 2 e A\. To see this, since 
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N p is rooted, there is a directed path from r to q 2 . If it 
included the arc (a, qi), then there would be a directed 
path in N p from q\ to q2) this is not possible since in 
that case the arc (q±, h) would be redundant in N, con- 
tradicting normality of N. Since N p contains the directed 
path from q 2 to x 2 missing the arc (a, qi), it follows that 
x 2 e Ai. Hence{r, x 2 } £ A 1 and {y, X\} £ B^. 

In N q consider the split T.(b, q 2 ) where b is the parent 
of q 2 and we remove the arc (b, q 2 ) from N q . Similarly 
to the case of N p we may write Z(b, q 2 ) = A 2 \ B 2 where 
{r, Xi) £ A 2 and {y, x 2 } Q B 2 . If N p were topologically 
the same as N q , then these splits would need to be com- 
patible. Yet re ii n A 2 , x 2 e A 1 n B 2 , Xi e £i n A 2 , 
and y <= B 1 Pi B 2 , contradicting compatibility. D 

Corollary 3.4. Suppose N = (V, A, r, X) is a phyloge- 
netic X-network that is normal. Suppose every hybrid 
vertex that is not a leaf has outdegree 1 and its unique 
child is normal. Suppose that there are exactly k hybrid 
vertices hy h 2 , h^ and that for i = 1, k, hybrid ver- 
tex h t has indegree d t . Then the total number of distinct 
trees displayed by N and the total number of parent 
maps are both Y\[di : i = 1, k]. 

4 Finding the weight function from d and N 

In this section we prove the main theorem, that the 
weights are determined by knowledge of N and the tree- 
average distances between members of X. For each 
hybrid vertex h we will assume either equiprobability at 
h or else a more complicated situation resembling Fig- 
ure 2. The assumptions can be different at different 
hybrid vertices. 

Theorem 4.1. Suppose N = {V, A, r, X) is a phyloge- 
netic X-network which is normal and semibinary. Let co 
be a weight function on A satisfying co(a, b) = 0 if b is 
hybrid and co(a, b) > 0 if b is normal. Assume that N is 
known and that the tree-average distance d(x, y; N) is 
known for each x and y in X. 

For each hybrid vertex h with parents q\ and q 2 , 
assume either 

(1) the inheritance is equiprobable at h; or 

(2) at least one parent (say q 2 ) satisfies that there 
exists q 3 such that 

(a) there is a normal path from q 3 to q^ 

(b) there is a normal path from q 3 to some x 3 in x 
which is disjoint from the normal path from q-$ to q 2 
except for the vertex q^ 

(c) there is no directed path from q 3 to q\. 

Then the weight function co is uniquely determined 
and can be computed explicitly. Moreover, for each 
hybrid h, the probabilities a{q it h) for each parent q t of h 
are uniquely determined and can be computed explicitly. 

See Figure 2 to understand the assumptions about h 
in (2). Throughout this section we will assume the 
hypotheses of Theorem 4.1. 



The proof primarily consists of a number of cases to 
handle different situations. We will present several of 
these special situations as lemmas and then later relate 
these together. Each lemma tells how certain distances 
or weights relate to distances between members of X. 

Lemma 4.2. Assume the hypotheses of Theorem 4.1. 
Suppose there is a normal path from a to b. Suppose 
there is a normal path from a to x e X which meets the 
normal path from a to b only in a. Suppose b has nor- 
mal paths to y and z in X which are disjoint except at b. 
Then d(a, b; N) = [d(r, y; Nf) + d(x, z; Nf) - d(r, x; M) - d 
(y, z; N)]/2. 

Proof. For each p e Par(N), the path from a to b, the 
path from a to x, the path from b to y, and the path 
from b to z must lie in N p since none of the arcs enters 
a hybrid vertex. Moreover, there must be a path from r 
to a which includes none of the arcs on the other paths 
mentioned above. See Figure 5a. Hence for each p e 
Par(N) one can verify 

d[r, y; N p ) = d{r, a;N p )+ d(a, b;N p )+ d(b, y; N p ) 
d{x, z;Np) = d(a, x; N p ) + d[a, b;N p ) + d(b, z; Np) 
d{r, x;N p ) = d(r, a;N p )+ d{a, x; N p ) 
d{y,z;N p ) = d{b,y;N p ) + d{b,z;N p ). 

It follows that 

[d(r, y; N p ) + d(x, z; N p ) - d[r, x; N p ) - d(y, z; N p )\/2 = d(a, b; N p ). 

Taking expected values we see d(a, b; N) = T,[Pr{p)d(a, 
b; N p ); p e Par(N)] = I\Pr{p)[d{r, y; N p ) + d(x, z; N p ) - d 
(r, x; N p ) - diy, z; N p )]/2: p e Par(N)] = [d(r, y; N) + d(x, 
z; N) - d(r, x; N) - diy; z; N)]/2. □ 

Lemma 4.3. Assume the hypotheses of Theorem 4.1. 

(1) Suppose (a, b) is an arc where a e X and b is nor- 
mal. Suppose b has normal paths to y and z in X which 
are disjoint except at b. Then m{a, b) = [d{a, y; N) + d 



{a, z; N) - 


d{y, z; A/)]/2. 






f 


(b) 


r 




(a) 

r 




V 




a 


\ qi /\ 






x / 


\ b x\/ a\ 
/y \ z 


y 




Figure 5 The situation of Lemma 4.2 (a), the situation of 
Lemmas 4.4 and 4.7 (b). If p is a parent map with p(a) = q v the 
figure shows part of W p together with the arc (q 2 , a). 
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(2) Suppose there is a normal path from a to b e X. 
Suppose there is a normal path from a to x e X which 
intersects the path from a to b only in a. Then d{a, b; N) 
= [d(b, r,N) + d{b, x; N) - d(r, x; N)]/2. 

In particular, suppose {a, b) is an arc, be X is nor- 
mal, and there is normal path from a to x e X which 
does not include b. Then co(a, b) = [d(b, r; N) + d(b, x; 
N) - d{r, x; N)]I2. 

(3) Suppose (a, b) is an arc and b is normal. Suppose 
there is a normal path from a to x e X which does not 
include the vertex b. Suppose b has normal paths to y 
and z in X which are disjoint except at b. Then m{a, b) 
= [d{r, y;N) + d{x, z; N) - d{r, x; N) - d(y, z; N)]I2. 

Proof. For (1) we take a = x in Lemma 4.2 and note 
that d(r, y; N) - d{r, a; N) = d{a, y; N). For (2) we take b 
= y = z in Lemma 4.2 and note that d(y, z; N) = 0. For 
(3), we use the normal path a, b as the path from a to 
b. □ 

Lemma 4.4. Assume the hypotheses of Theorem 4.1. 
Suppose there is a normal path from a to y e X where a 
is hybrid with indegree 2 and parents qi and q 2 . Assume 
qi and q 2 have normal paths to X\ and x 2 respectively in 
X. Then d(a, y; N) = [d(y, X\) N) + d(y, x 2 ; N) - d(x lt x 2 ; 
A01/2. 

Proof. See Figure 5b. We first show that the portion of 
the figure including the paths from q l to Xi, from q 2 to 
x 2 , from a to y and the arcs {qi, a) and (q 2 , a) accurately 
represents the hypotheses of the lemma. (The network 
in Figure 3, which is not normal, has this situation with 
a = 10, q\ = 5, q 2 = 6, Xi = x 2 = 4, y = 2. Hence Figure 
5b is wrong for the network in Figure 3, primarily 
because the normal paths from q± to X\ and from q 2 to 
x 2 intersect.) I claim that for normal networks the nor- 
mal paths from q l to X\ and from q 2 to x 2 have no ver- 
tex in common. To see this, suppose there were such a 
common vertex w. In that case by following the normal 
paths backwards from w we infer that either q 1 lies on 
the path from q 2 to x 2 or else q 2 lies on the path from 
qi to X\. In the former case there is a directed path 
from q 2 to q\, whence the arc (q 2 , a) is redundant, con- 
tradicting the normality of the network. In the latter 
case a) is redundant. It follows that the paths are 
disjoint. In particular, X\ & x 2 . 

Similarly, neither path can intersect the normal path 
from a to y. If, for example, the path from q\ to X\ 
intersected the path from a to y, then by following the 
normal paths backwards we would have that either q 1 
lies on the path from a to y or else a lies on the path 
from q l to X\. In the former case there would be a 
directed cycle from q 1 to a to q\, contradicting that the 
network is acyclic. In the latter case the hybrid vertex a 
would lie on the normal path from q x to X\, contradict- 
ing that it is a normal path. 



Suppose p e Par{N) is a parent map that satisfies p(a) 
= q±, and let p denote the complementary parent map 
that agrees with p except that p'{a) = q 2 . Thus N p and 
N p - agree except that N p contains the arc (q lt a) while 
N p - contains instead the arc (q 2 , a). In particular they 
both contain the same paths from q 1 to X\, from q 2 to 
x 2 , and from a to y. Let v = mrca(<7i, q 2 ; N p ). There is a 
directed path from r to v since r is the root (possibly r 
= v). There are directed paths from v to q\ and v to q 2 
in N p which are disjoint except for v. Figure 5b thus 
shows a portion of N p relevant to the lemma, together 
with the arc {q 2 , a). 

In N p we see from Figure 5b that 

d(y,Xi,N p ) = d(a,y;N p ) + w[q u a) + d(q lr x x ; N p ), 

d{y,x 2 ;N p ) = d{a,y;N p ) + w{q lt a) + d{q x , q 2 ;N f ) + d(q 2 ,x 2 ;Np), 

d(xi,x 2 ; N p ) = d(q l ,x 1 ;N p ) + d(q lt q 2 ; N p ) + d{q 2 , x 2 ;N p ). 

By substituting these formulas we see that [d(y, Xii AL) 
+ d(y, x 2 ; Np) - d{x\, x 2 ; N p )]/2 = d(a, y; N p ) + w{q lt a). 
Since oj(qi, a) = 0 because a is hybrid, it follows 

[d(y, xi, Np) + d{y,xr,Np) — d(xi,x 2 ;N p )]/2 = d(a,y;N p ) 

The network N p > is the same except that (q lt a) is 
replaced by (q 2 , a). A symmetric argument then shows 

[d(y, xi,Np>) + d(y,X2,Np>) — d(xi,x 2 ;N P ')]/2 
= d{a, y; N p <) + co(q 2 , a) = d{a, y; N p /). 

Since the indegree of a is 2, every parent map p satis- 
fies either p(a) = q l or p(a) = q 2 . It follows that for 
every p e Par(N), [d(y, x^, N p ) + d(y, x 2 ; N p ) - d{X\, x 2 ; 
Np)]l2 = d(a, y; N p ). 

When we take the expected value over all p e Par(N) 
we obtain by linearity [d(y, X\, N) + d[y, x 2 ; N) - d{x\, 
x 2 ; N)]/2 = d(a,y;N). □ 

Lemma 4.5. Assume the hypotheses of Theorem 4.1. 
Suppose (a, b) is an arc such that b is normal, and a is 
hybrid with indegree 2 and parents q l and q 2 . Assume 
q l and q 2 have normal paths to X\ and x 2 respectively in 
X. Suppose b has normal paths to w and z in X where 
the paths are disjoint except for b. Then 

a)(a, b) = \d{x l ,w;N) +d(x 2 ,z;N) -d{x l ,x 2 ;N) - d (w, z; N)]/2. 

Proof. Since b is normal and the paths from b to w 
and from b to z are normal and disjoint except for b, we 
have d(w, z; N p ) = d(b, w; N p ) + d(b, z; N p ) for every 
parent map p, whence d{w, z; N) = d{b, w; N) + d(b, z; 
N). Similarly d(a, w; N) = oj(a, b) + d(b, w; N) and d(a, 
z; N) = co{a, b) + d{b, z; N). 
Hence [d{a, w; N) + d(a, z; N) - d(w, z; N)]/2 
= [m{a, b) + d{b, w; N) + co{a, b) + d{b, z; N) - d(b, w; 
N) - d(b, z; N)]/2 = co(a, b). 
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In addition, Lemma 4.4 applies with y replaced by w 
since the path from a to b to w is normal. Hence d(a, 
w, N) = [d(w, Xii N) + d{w, x 2 ; N) - d(x u x 2 ; Nf)]/2. 

Lemma 4.4 also applies with y replaced by z. Hence d 
(a, z; N) = [d(z, %; TV) + d(z, x 2 ; N) - d(x lt x 2 ; N)]/2. 

By substitution it follows w{a, b) - [d(a, w, N) + d(a, 
z; N) - d(w, z; N)]/2 

= [d(w, x\; N)+d(w, x 2 ; N) - 2d{x\, x 2 ; N)+d(z, Xi, N)+d 
(z, x 2 ; N) - 2d(w, z; N)]/4. 

But symmetry shows that for each parent map p, d(w, 
x 2 ; N p ) + d(z, x x ; N p ) = d(w, Xi, N p ) + d(z, x 2 ; N p ). 
Hence by taking the expected value over p e Par(N), we 
have d(w, x 2 ; N) + d(z, xi, N) = d{w, xi, N) + d{z, x 2 ; N). 

Thus co(a, b) = [2d(w, Xi, N) - 2d(x\, x 2 ; N) + 2d(z, x 2 ; 
N) - 2d(w, z; 7V)]/4 = [d(w, xa AO - d{x lt x 2 ; N) + d{z, x 2 ; 
AO - d(w, z; N)]/2. □ 

For the next calculations we require a preliminary 
result. Suppose h 0 is hybrid with indegree 2 and parents 
qi and q 2 . For a given parent map p with p(ho) = qi, let 
p denote the complementary parent map and G p = N p 
U N p - be the network N p with the additional arc {q 2 , h 0 ). 
Let H be the set of hybrid vertices of N. For each p e 
Par(N) satisfying p(h 0 ) = q lt let W(p) = Y\[a{p(h), h): h 
e H, h * h 0 ]. Hence Pr(p) = a{q lt h 0 )W (p) and Pr[p') = 
a(q 2 , h 0 )W(p). 

Lemma 4.6. For any X-network M which is a subnet- 
work of N, suppose C(M) is a linear combination of 
expressions of form d{a, b; M). Then 

(1) C(G P ) = a(q u h 0 )C(N p ) + a(q 2 , h 0 )C(N p - ). 

(2) C(N) = Z[W{p)C(G p ): p e Par(N), p(h 0 ) = f J. 
Proof. For (1), d{x, y; G p ) = a{q lt h 0 )d(x, y; N p ) + a(q 2 , 

h 0 )d{x, y; N p -). For (2) each term d(a, b; N) = Pr(p)d{a, 
b; N p ). Hence C(A/) = ZPr{p)C(N p ) by linearity 
= nPr(p)C(N p ) + Prip')C(N p .y. p(h 0 ) = ?1 ] 
= Ua(qi, ho)W(p)C(N p ) + a(q 2 , h 0 )W(p)C(N p -): p(h 0 ) = 
qi] 

= nw(p)[a( qi , h 0 )C(N p ) + a(q 2 , h 0 )C{N p )\. p(h 0 ) = ?1 ] 
= UW (p)C(G p ): p e Par(N), p(h 0 ) = q,]. □ 
Lemma 4.7. Assume the hypotheses of Theorem 4.1. 
Suppose a is hybrid with indegree 2 and parents q\ and 
q 2 . Assume the inheritance is equiprobable at a. Suppose 
there is a normal path from q\ to X\ e X, from q 2 to x 2 
e X, and from a to y e X. Then d(q lt X\\ N) = d(x\, y; 
A) - d(r, y; N)+[d{r, x x ; N)+d(r, x 2 ; N) - d{x x , x 2 ; N)]/2. 

Proof. See Figure 5b. As in the proof of Lemma 4.4, 
the portion of the figure including the paths from qi to 
X\, from q 2 to x 2 , from a to y and the arcs (q lt a) and 
(q 2 , a) accurately represents the hypotheses of the 
lemma since N is normal. Suppose p e Par(N) satisfies 
p{a) = qi. Let p' denote the complementary parent map 
such that p'(a) = q 2 . Then all three normal paths in the 
statement lie in both N p and N p - since they contain no 
hybrid arcs. Note that AL contains (q lt a) and not (q 2 , 
a), while AT p ■ contains (q 2 , a) but not (q lt a). Moreover, 



the path in N p between q\ and q 2 must be the same as 
the path in N p - between q^ and q 2 . Let v = mrca^i, q 2 ; 
Np); then v is also mrca(^i, q 2 ; Np-). 

For any phylogenetic X-network M with the same 
base-set X write L(M) = d{x lt y; M) - d(r, y; M) + [d(r, 
Xi, M) + d(r, x 2 ; M) - d(x lt x 2 ; M)]I2. 

Note that L is a linear expression. 

In both N p and N p -, d(r, Xi) = d(r, v) + d(v, q{) + d{q\, X\) 

d(r, x 2 ) = d(r, v) + d(v, q 2 ) + d{q 2 , x 2 ) 

d(x lt x 2 ) = d(qi, x y ) + d{v, qj + d(v, q 2 ) + d{q 2 , x 2 ). 

Hence [d(r, X\) + d{r, x 2 ) - d{x\, x 2 )]l2 = d(r, v). 

In N p we find d(xi, y; N p ) = d(x lt q$ N p ) + m(qi, a) + 
d(a, y; N p ), and d(r, y; N p ) = d(r, v; N p ) + d(v, q\; N p ) + 
w(qi, a) + d{a, y; N p ). 

Hence L(N p ) = d(x x , y; N p ) - d{r, y; N p ) + [d(r, Xy, N p ) 
+ d(r, x 2 ; N p ) - d(x lt x 2 ; N p )]/2 = d(x x , y; N p ) - d(r, y; 
Np) + d{r, v; N p ) = d{x\, qa N p ) + m{qi, a) + d(a, y; N p ) 

- d{r, v; N p ) - d(v, q x ; N p ) - co(q 1 , a) - d{a, y; N p ) + d{r, 
v; N p ) = d(xi, qi; N p ) - d(v, qr, N p ). 

In Np- we find d(x lt y; N p -) = d(q lt Xtf N p -)+d(v, q x ; N p -) 
+d{v, q 2 ; N p -)+co(q 2 , a)+d(a, y; N p -) d(r, y; N p -) = d(r, v; 
Np-) + d(v, q 2 ; N p -) + m{q 2 , a) + d{a, y; N p -). 

Hence L(N p -) = d{%\, y; N p -) - d(r, y; N p -) + [d{r, X\, 
Np-) + d{r, x 2 ; N p ) - d{x\, x 2 ; N p -)]/2 = d{x\, y; N p -) - d{r, 
y; N p -) + d(r, v; N p -) = d(q 1: x Y ; N p ) + d(v, q^, N p -). Thus 
L(N p ) + L(N p -) = d{q lt Xii N p ) - d{v, qi, N p ) + d{q lt Xi, 
Np-) + d(v, qi, Np-) = d(qi, Xi, N p ) + d{qi, Xi, N p -) since 
d(v, qi, N p ) = d(v, qi, N p -). 

Using Lemma 4.6(1) with h 0 = a, we see that L(G P ) = 
a(q u a)L(N p ) + a{q 2 , a)L{N p -) so L(G P ) = (1/2)[L(N P ) + 
L(N p -)] by equiprobability at a. 

From above it follows L(G p ) = (l/2)d(q 1 , Xi, N p ) + (1/ 
2)d{q 1 , Xi, Np-). 

By Lemma 4.6(2) L(N) = L[W(p)L{G ): p e Par(N), p 
{a) = q] 

= nW(p)(l/2)d( qi , xi, Np) + W (p)(l/2)d( qi , xi, Np): p 
(a) = qi] 

= ILlPr^diqi, Xi, N p ) + Pr(p')d{q\, Xi, N p -): p e Par 
(AO, p(a) = q Y ] 
= nPrip)d{qx, Xi, N p ): p e Par(N>] 
= d{q lt Xi, N). □ 

It is interesting in the proof that different choices of 
the parent map p may yield different vertices v; never- 
theless all these choices cancel out. 

Lemma 4.8. Assume the hypotheses of Theorem 4.1. 
Suppose h is hybrid with indegree 2 and parents q\ 
and q 2 . Assume equiprobable inheritance at h. Suppose 
there is a normal path from q 2 to x 2 e X and from h 
to y e X. Suppose qi has normal child b and there are 
normal paths from b to Z\ e X and from b to z 2 e X 
such that these paths intersect only at b. Then m{q\, b) 
= [2d{z lt y; N) - 4d(r, y; N) + d{r, Zi, N) + 2d(r, x 2 ; N) 

- d(z lt x 2 ; A/) + 2d(z 2 , y; N) + d{r, z 2 ; N) - d(z 2 , x 2 ; N) - 
2d{z 1 , z 2 , A/)]/4. 
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In particular, if b is a leaf, then m(q lt b) = [2d(b, y; N) 
- 2d(r, y;N) + d(r, b; M) + d(r, x 2 ; N) - d{b, x 2 , N)]/2. 

Proof. By an argument like that for Lemma 4.2, for 
each p e Par{N) we have w(q lt b) = [d(q^, Zii N p ) + d 
(q v z 2l N p ) - d(z u z 2 ; N p )]/2 

whence by averaging over p e Par{N) we find oj{q lt b) 
= [d(q u z L ; N) + d(q 1: z 2 ; N) - d(z u z 2 ; N)]/2. 

But the paths from qi to Z\ and from q 2 to z 2 are nor- 
mal, so by Lemma 4.7 d(qi, Z\\ N) = d(z.\, y; N) -d(r, y; 
N) + [d(r, Z\\ N)+ d(r, x 2 ; M) -d{z x , x 2 ; M}]/2 and d(q lt z 2 ; 
AT) = d(z 2 , y; N)-d(r, y; M)+[d(r, z 2 ; N)+d(r, x 2 ; M)-d(z 2 , 
x 2 ; N)]/2. Hence co{q lt b) = [d(z lt y; N)-d(r, y; M)+[d(r, 
Zi; M)+d{r, x 2 ; M)-d{zi, x 2 ; N)]/2 +d(z 2 , y; N)-d(r, y; N) 
+ [d{r, z 2 ; N)+d(r, x 2 ; M)-d{z 2 , x 2 ; N)]/2-d(z lt z 2 ; M)]/2 = 
[2d(z lt y; M)-2d(r, y; N)+d(r, Zii M)+d{r, x 2 ; N)-d(z lt x 2 ; 
N)+2d{z 2 , y; N)-2d(r, y; N) + d(r, z 2 ; N) + d(r, x 2 ; N) - d 
{z 2 , x 2 ; N) - 2d{z 1 , z 2 ; A/)]/4 = [2d(z lt y; N)-4d(r, y; M)+d 
(r, z-ii N)+2d(r, x 2 ; 7v)-<i(zi, x 2 ; M)+2d(z 2 , y; N)+ d(r, z 2 ; 
N) - d(z 2 , x 2 ; N) - 2d{z 1 , z 2 ; A/)]/4. 

If b is a leaf we may take b = Z\ = z 2 to obtain m{q lt b) 
= [2d(b, y; N) - 4d(r, y; N) + d(r, b; M) + 2d(r, x 2 ;N) - d 
{b, x 2 ; N) + 2d(b, y; M) + d(r, b; M) - d(b, x 2 ; M) - 2d(b, 
b; A/)]/4 = [4:d{b, y; A/)-4<i(r, y; M)+2d{r, b; M)+2d(r, x 2 ; 
N)-2d{b, x 2 ; M)-2d(b, b; A/)]/4 = [2d(b, y; N) - 2d{r, y, N) 
+ d{r, b;N) + d{r, x 2 ; N) - d{b, x 2 ; N)]/2. □ 

We next prove analogues of Lemma 4.7 and Lemma 
4.8 for the case where the hybrid is not equiprobable 
and we are dealing with the situation in Figure 2 rather 
than Figure 5b. 

Lemma 4.9. Assume the hypotheses of Theorem 4.1. 
Suppose h 0 is hybrid with indegree 2 and parents qi and 
q 2 . Suppose there is a normal path from q\ to X\ e X, 
from q 2 to x 2 e X, and from h to y e X. Assume q 3 is 
such that there is a normal path from q 3 to q^ a normal 
path from q 3 to x 3 e X, but no directed path from q 3 to 
q\. Suppose M is a phylogenetic X-network that is a sub- 
network of N. Let 

(a) w rv (M) = [d(r, M) + d(r, x 3 ; M) - d(x lt x 3 ; M)]I2 
= [d(r, Xii M) + d(r, x 2 ; M) - d{x x , x 2 ; M)]/2 

(b) u/„, 3 (M) = [d{r,x 3 ;M) + d{xi,x 2 M)-d{T,x l ;M) - d(x 3 ,x 2 ;M)]l2 

(c) Wq 3X3 (M) = [d(r,x 3 ;M) + d(x 3 ,x 2 ;M) - d(r,x 2 ;M)]/2 

(d) w hy {M) = [d{y, x 2 ; M) + d(y, x L ; M) - d(x lt x 2 ; M)]l 

2 

(e) E 2 {M) = d(x u y; M) - d(r, y; M) + w^M) 

(f) E 4 (M) = d(x 2 , y; M) - d(r, y; M) + Wn/ (M) 

(g) a (M) = [2d(x 3 ,y; M) - 2 W<bK> (M) - 2w hy (M) — d{r, X%;M) + E 2 (M) + 

2w w (M) + E 4 (M) - d{r,x 2 ;M) + 2w m (M)]/[4w vl , 3 (M)} 
(h) 

w vqi (M) = [d{r, Xl ;M) - E 2 (M) - w„ (M)]/[2a (M)] 

( l) (<W> - [i(xi,r, M) - («) - I'hy (") - " (w,,,, (M) + l»v„ (M))]/<1 - <I (M)) 

(j) w <tixi ( M ) = d{r,Xi;M) - w, v (M) - w m (M) 
(k) 

Wq lX2 (M) = d{r,x 2 ;M) - w rv (M) - w V(j3 (M) - w^ 2 (M) 



(I) C (M) = 2d(x 3 , y;M) — 2w, 3 _, s (M) -2w hy (M)- d{r, x 1 ;M) + E 2 (M) + 

2w rv (M) + £ 4 (M) — d[r, x 2 ; M) + 2w v<>3 (M) 
(m) D(M) = 4w vq ,(M) . 
Then 

(i) a{q lt h;N)= a{N) = C{M) /D(N). 

(ii) d{qi,xi;N) = (N) . 
(Hi) d{q 2 ,x 2 ;N) = w qiXl (N) . 

Proof. Suppose p e Par(M) is a parent map satisfying p 
(h 0 ) = qi and p' is the complementary parent map 
agreeing with p except that p' (h 0 ) = q 2 . Let G p = N p 
with the additional arc {q 2 , h 0 ), so G p = N p U N p : A por- 
tion of G p is shown in Figure 2. Note that Figure 2 is 
accurate for every p (although the vertex v may differ 
for different p) because of the hypotheses on q lt q 2 , q 3 , 
ho, X\, x 2 , x 3 , and y. 

Write u rv = d(r, v; G p ), u vqi = d{v,qi;G p ) , 
u qsXs = d(q 3 ,x 3 ; G p ), u q3X3 = d(q 3 ,x 3 ; G p ), 

u qi x 2 = d{q 2 ,x 2 ;G p ), u qiXl = d{q 2 ,x 2 ;G p ), u hy = d(h, y; 
G p ), Utj lXi = d(qi,xi; G p ). 

The definition of the tree-average distance yields the 
following ten equations for G p , where a = a{q lt h 0 ). 

d[r, x\) Gp) = u w + u vqi + Uq lXl 

d(r,Xi', Gp) = Un, + U v q 3 + U(j 3 x3 

d(j, X 2l Gp) = U w + U v q } + Uq s q 2 + Uq lX2 

d{r, y; G p ) = a[u„ + u vqi + u hy ] + (1 - a)[u n + u vqi + u q , q2 + u hy ] 

= Urv + u hy + a uvqi + (1 - a){u vq , + u q , q2 ) 

d[ X \,Xs', G p ) = Uq lXl + u vqi + u vqi + U q3X} 

d[Xi, X 2 ; Gp) = U qiXl + U v q l + U vqi + Uq 3 q 2 + Uq 2X1 

d(xi,y;Gp) = a[Uq ]X] + Uhy] + (1 — (^[u^j^j + U v qj + U v q^ + U q3 q 2 + Uhy] 

= + Uhy + {1 - ")["■«,, + U m + Uq 3 q 2 ] 

d{x 3l X 2 , Gp) = Uq 3Xs + Uq 3q2 + Uq lX2 

d(x 3 , y, Gp) = a]Uq 3Xl + U„q 3 + U vqi + Uhy] + (1 - Ot)]Uq jX3 + Uq 3 q 2 + Uhy] 
= Uq lX3 + Uhy +a(u v q 3 + U mh ) + (1 - a)(Uq 3 q 2 ) 

d(x 2 , y; G p ) = a]uq lXl + Uq^ 2 + u„ ((3 + u vqi + u hy ] + (1 - a)[%x 2 + u hy ] 

= Uq lXl + U hy + Ol(Uq 3 q 2 + li„, b + U v qJ 

We now solve this system of ten equations. 

It is straightforward by simplifying the expressions 
that [d{r, %; G p ) + d{r, x 3 ; G p ) - d{x\, x 3 ; G P )]I2 = u, v so 
a comparison with (a) shows that w rv {G p ) = u rv . Simi- 
larly [d(r, Gp) + d{r, x 2 ; G p ) - d{x lt x 2 ; G p )]l2 = u^ so 
the two expressions in (a) for w rv (G p ) are the same. 

Likewise from the ten equations, 
]d(r,x 3 ; G p ) + d(x\,x 2 ; G p ) — d{r,X\, G p ) — d(Xi,x 2 ; G p )]/2 = u„ h 

so W vqs (G p ) = U vq3 ; 
[d{r,x 3 ; G p ) + d{x 3 ,x 2 ; G p ) - d[r,x 2 ; G p )]/2 = Uq, X3 

SO Wq 3X3 {Gp) = Uq 3X3 ; 

[d(y, x 2 ; G p ) + diy, x x ; G p ) - d(x lt x 2 ; G p )]/2 = u hy so 

Why(G p ) = Uhy 

From the system of ten equations we 

See^CCp) = Uq lXl + (1 - a)u vqi - a„„ I)1 = Uq lXl + (1 - 2a)u mh . 
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Since d{r,X\, G p ) = u rv + u vqi + u qiXl it follows 
d{r,Xi;G p ) = u„, + u„< ll + E 2 {G p ) - (1 - 2a)u vqi whence 

2au vqi = d{r,Xi;G p ) - E 2 {G p ) - u, v (1) 

Similarly 

E 4 [G p ) = u^+Uhy+otlu^+u^+u^-Urv-Uty-au^-ll-ajlu^+Uw^+Un, 
= "to +«("<| 3 </2 + - (1 -<*)("i«f 3 = u qm + ( 2ff - 1)K*, +"l)3</j). 

But from d{r,x 2 ;G p ) = u,-„ + k^, + u ?3(?2 + u,, 2X2 it fol- 
lows u^^ = d{r,x 2 ; G p ) -u„- u vqi - u^ qi so 

Ei{Gp) = d(r,x 2 ; G T ) - u„- u mi - + (2a - l)(u mi + u Wl ). 
This can be solved to show 

(2 - 2a){u vq , + u^J = d{r,x 2 ; G p ) - u w - E 4 (G p ) (2) 

Since 

d{x 3 ,y, G p ) = Uq, X3 + u hy + a{u vqi + u vctl ) + (1 - a){u q3q2 ) 
we obtain 

a(u vqi + u vcll ) + (1 - a)(u lMl ) = d[x 3 ,y) - u qiXl - u hy (3) 

Note (1), (2), and (3) are equations in the unknowns 
a, w vcii,Wq 3 t)2 in terms of known quantities such as w rv , 
Wq 3 xi, Why, w vq3, E 4 {G p ). These three equations in three 
unknowns can be solved to yield for G p (for any p e 
Par(N) with p(h) = qx) the following: 

a(Gp) = [2d(x 3 , y;Cp) - 2w^x, - 2w hy - d[T,Xr, G p ) + E 2 (C p ) + 2w„ + E t (Gp)- 
d(r, x 2 ; C f ) + 2w wls \l\Aui wh ] 

"V ( g p) = [d{r,Xi;G p ) - E 2 (G p ) - w n ]/[2a(G p )] 

w q,42{ G p) = [d{x 3 ,y;G p ) - w %X3 - w hy - a(w„ % +u;„ 4l )]/(l - a{G p )). 

Moreover, the value of a is independent of the choice 
of p. 

We thus have C{G p ) = aD{G p ) for each p satisfying p 
(h 0 ) = qv 

By Lemma 4.6, C(N) = Z[W(p)C(G p ): p(h 0 ) = and 
D(N) = nW(p)D(G p ): p(h 0 ) = qj. 

Hence C(N) = Z[W(p) aD(G p ): p(h 0 ) = q{\ = a L[W{p) 
D(G P ): p(h 0 ) = qi] = aD(N). 

It follows that a = C{N) *■ D{N). This proves (i). 

Similarly, for any p e Par(N) satisfying p{h 0 ) = q lt 
since the path from q x to X\ is normal, 
d(qi,X\)N) = d{qi,X\,Gp) = w qiXl (G p ). By Lemma 4.6 d 
(q lt Xl ; N) = nW{p)d{ qi , Xl ; G p ): p e Par{M), p(h 0 ) = 
qi] = ^[^(pj^x^Gp) : p e Par[N),p{h 0 ) = q t ] = w qiXl {N), 
proving (ii). Similarly d(q 2 ,x 2 ;N) =w q2Xl (N), proving 
(iii). □ 

Lemma 4.10. Assume the hypotheses of Theorem 4.1. 
Suppose h 0 is hybrid with indegree 2 and parents q\ and 
q 2 . Suppose there is a normal path from q 3 to qi, from 
q 2 to x 2 e X, from qi to X\ e X, from h 0 to y e X, and 
from q 3 to x 3 e X but no directed path from q 3 to q\. 

(a) Suppose q\ has normal child b and there are nor- 
mal paths from b to X\ e X and from b to Z\ e X such 
that these paths intersect only at b. Then m{qi, b) = [d 



{qi, Xi, N) + d{q\, z±l N) - d(x,\, Z x ; A/)]/2, where d{q lt X\, 
N) and d(qi, Z^j N) are determined by Lemma 4.9. 

(b) Suppose q 2 has normal child c and there are nor- 
mal paths form c to x 2 e X and from c to z 2 e X such 
that these paths intersect only at c. Then m{q 2 , c) = [d 
(q 2 , x 2 ; N) + d(q 2 , z 2 ; N) - d(x 2 , z 2 ; N)]/2, where d{q 2 , x 2 , 
N) and d{q 2 , z 2 ; N) are determined by Lemma 4.9. 

Proof. For (a), Lemma 4.9 applies to yield d(qi, X \; N). 
By a parallel computation with Zi replacing x lt Lemma 
4.9 also yields d(q lt z x ; N). Since the paths from q 1 to X \ 
and Z\ are normal, it follows that co(q lt b) = d(q\, b; N) 
= [d(qi, Xx, N)+d(qi, z x ; N)-d(xi, Zi, N)]/2 by an argu- 
ment like that of Lemma 4.2. A similar argument shows 
(b). □ 

We now turn to the proof of the main theorem 4.1: 
Proof. We seek to reconstruct each weight a> {a, b) 
and each probability. If b is hybrid, then by assumption 
co [a, b) = 0. Hence we may assume b is normal. 

At the tail a we have the following exhaustive list of 
possibilities: 

Case Ax. There is a normal path from a to some w e 
X such that the path does not go through b. This 
includes the possibility where a e X (in which case the 
trivial path at a satisfies the condition). Since r e X, this 
includes the case a = r. 

Case A 2 . a is hybrid and b is its unique child. Since a 
is hybrid it has two parents q\ and q 2 . Choose a normal 
path from q x to e X and from q 2 to w 2 e X. 

Case A 3 . a has a hybrid child h' with other parent q. 
Choose a normal path from q to Wi e X and from h' to 
w 2 e X. 

At the head b, either b e X or else b is not a leaf and 
b has at least two children, at least one of which must 
be normal. Hence we have the following exhaustive list 
of possibilities: 

Case Bx.be X. 

Case B 2 . b has two normal children C\ and c 2 . For i = 
1, 2 there is a normal path from c ; to Xi e X. 

Case B 3 . b has one normal child c and a hybrid child h 
for which there is exactly one other parent q. There is a 
normal path from c to x & X, from h to y e X, and 
from q to z & X. 

Since there are 3 cases for a and three cases for b, we 
must consider 9 cases. The case where A t is combined 
with Bj will be denoted Case Afij. We will compute 
ft) (a, b). To compute the probabilities, it suffices to com- 
pute a{a, h') in situation A 3 . 

Case AxBx. Assume there is a normal path from a to 
some w e X such that the path does not go through b, 
and b e X. Then Lemma 4.3(2) shows that a>(a, b) = [d 
(r, b;N) + d(w, b; N) - d(r, w; M}]/2. 

Case AxB 2 . Assume there is a normal path from a to 
some w e X such that the path does not go through b. 
Assume b has two normal children C\ and c 2 . For i = 1, 
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2 there is a normal path from c, to x t e X. In this case, 
Lemma 4.3(3) shows that m{a, b) = [d(r, X\, N) + d{w, 
x 2 ; A/) - d(r, w; N) - d(x lt x 2 ; N)]/2. 

Case A 2 B 1 . Assume a is hybrid and b is its unique 
child. Assume b e X Since a is hybrid it has two par- 
ents qi and q 2 . Choose a normal path from q x to w^ e 
X and from q 2 to w 2 e X. In this case, Lemma 4.4 
shows that co(a, b) = [d(b, Wi; N) + d(b, w 2 ; N) - d{w\, 
w 2 ; N)]/2. 

Case A 2 B 2 . Assume a is hybrid and b is its unique 
child. Since a is hybrid it has two parents q\ and q 2 . 
Choose a normal path from q 1 to e X and from q 2 
to w 2 e X. Assume b has two normal children C\ and c 2 . 
For i - 1, 2 there is a normal path from c; to e X. In 
this case by Lemma 4.5 we have &>(«, b) = [d{w!, A/) 
+ <i(w 2 , #2; AO - d(wi, w 2 ; N) - d(x,\, x 2 ; N)]/2. 

Case A 3 B X . Assume a has a hybrid child h' with other 
parent (7'. Choose a normal path from q to W[ e X and 
from h' to w 2 e X. Assume b e X In the equiprobable 
case, Lemma 4.7 with ^ = a, X\ = b, x 2 = w 2 shows ft) (a, 
b) = d(a, b;N) = d(b, w 2 ; NT) - d(r, w 2 ; N) + [d(r, b; M) + 
d(r, ivij A/) - d(b, w x ; N)]/2. 

In the other case, Lemma 4.9(H) with qi = a and X\ = 
b yields co(a, b) while Lemma 4.9(i) yields a{a, h'). 

Case A 3 B 2 . Assume a has a hybrid child /z' with other 
parent q . Choose a normal path from q to w l e X and 
from h' to vw 2 e X Assume b has two normal children 
Ci and c 2 . For i = 1, 2 there is a normal path from c, to 
#i e X In the equiprobable case, Lemma 4.8 with qi = 

a, y = w 2 , Z\ = X\, z 2 = x 2 , h = h', q 2 = q, x 2 = W\ shows 
co{a, b) = [2d(x lt w 2 ; NT) - 4rf(r, w 2 ; N)+d{r, x x ; N)+2d{r, 
Wi, N) - d(x lt Wi, N)+ 2d{x 2 , w 2 ; N) + d(r, x 2 ; N) - d{x 2 , 
Wj) - 2d{x x , x 2 )]/4. 

In the non-equiprobable case Lemma 4.10a applies to 
determine m(a, b), while Lemma 4.9(i) determines a(a, 
h'). 

Case A^B^. Assume that there is a normal path from a 
to some w e X such that the path does not go through 

b. Assume b has one normal child c and a hybrid child 
h for which there is exactly one other parent q. There is 
a normal path from c to x e X, from h to y e X, and 
from q to z e X See Figure 6. Since N is normal, an 
argument like that for Lemma 4.4 shows that Figure 6 is 
accurate for the situation. 

In this situation, by Lemma 4.4(2), d(a, x; M) = [d{x, r, 
N) + d(x, w; N) - d(r, w; N)]/2. In the equiprobable case, 
by Lemma 4.7, with b = q lt x± = x, z = x 2 , d(b, x; N) = d 
{x, y; N) - d(r, y; N) + [d(r, x; N) + d(r, z; N) - d(x, z; 
AOJ/2. 

Finally co(a, b) = d{a, x; N) - d(b, x; N). In the non-equi- 
probable case, Lemma 4.9 with a = q$ and b = q 2 yields 
the computation of w{a, b) = w q3i q 2 {N) and Lemma 4.9(i) 
shows a(b, h) = a(q 2 , h;N) = l- a(qu h; N). 




Case A 2 B 3 . Assume a is hybrid and b is its unique 
child. Since a is hybrid it has two parents q\ and q 2 . 
Choose a normal path from qi to W\ £ X and from q 2 
to w 2 e X Assume b has one normal child c and a 
hybrid child h for which there is exactly one other par- 
ent q. Choose a normal path from c to x e X, from h to 
y e X and from q to z e X 

See Figure 7a. An argument like that for Lemma 4.4 
shows that the figure accurately represents what is 
needed in the argument. In particular, the normal paths 
from qi to w lt from q 2 to w 2 , and from q to x have no 
vertex in common. Similarly the paths from q to z, from 
b to x, and from h to y have no vertex in common. 

By Lemma 4.4, d(a, x; N) = [d(x, w^, N) + d(x, w 2 ; N) - 
d(wi, w 2 ; N)]/2. In the equiprobable case, by Lemma 4.7, 
d{b, x; A/) = d(x, y; A/) - d(r, y; M) + [d{r, x; N) + d(r, z; 
M) - d(x, z; M}]/2. 

In the non-equiprobable case, Lemma 4.9(ii) or 4.9(iii) 
similarly yields d(b, x; N). But co(a, b) = d(a, x; N) - d(b, 
x; N) since the path from a to x is normal, so subtract- 
ing these formulas leads to a formula for co(a, b). 

Case A 3 B 3 . Assume that a has a hybrid child h' with 
other parent q. Choose a normal path from q to W\ e 
X and from h' to w 2 e X Assume b has one normal 
child c and a hybrid child /j for which there is exactly 
one other parent q. Choose is a normal path from c to x 
e X from h to j e X and from ^ to z e X 

See Figure 7b. The argument will make two uses of 
Lemma 4.7 or 4.9, and Figure 7b accurately represents 
the situation by arguments like those in Lemma 4.4. 

In the equiprobable case, by Lemma 4.7, d(a, x; N) = d 
(x, w 2 ; N)-d(r, w 2 ; N) + [d(r, x; N)+d{r, w x ; N)-d(x, wa 
N)]/2, d(b, x; N) = d(x, y; N) - d(r, y; N) + [d{r, x; M) + 
d(r, z; N) - d(x, z; AZ)]/2. 

But then co(a, b) = d{a, x; N) - d(b, x; N) since the 
path from a to x is normal. In the other case, Lemma 
4.9(H) or 4.9(iii) yields d(a, x; N) and d(b, x; N) and 
again co(a, b) is determined. Moreover, Lemma 4.9(i) 
yields a{a, h') and a(q, h). 

Since all 9 cases yield a formula for w(a, b) and also 
any relevant probability when a is parent to a hybrid 
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Figure 7 Case A 2 B 3 (a), Case A 3 B 3 (b). 




and & is a normal child of a, the proof of the theorem is 
complete. 

Corollary 4.11. Suppose N = (V, A, r, X) is a normal 
phylogenetic X-network such that each hybrid vertex has 
indegree 2 and, if it is not a leaf, outdegree 1. Let n = \ 
X\ and a be the total number of arcs directed into any 
normal vertex. Then a < (") . 

Proof. We may assume that the arcs have weights and 
that each hybrid is equiprobable. Each of the weights 
m{u, v) if (u, v) is an arc directed into a normal vertex v 
is uniquely determined from the (2) linear equations 
obtained from the (") distances given by the tree-aver- 
age distance function. Hence there are at most (") vari- 
ables. D 

Figure 1 gives an example in which n = 4 and there 
are exactly (2) = 6 arcs directed into a normal vertex. 
Hence the bound in Corollary 4.11 is tight. 

5 An example 

We illustrate the calculations of Section 4 to find the 
values of the weight function given the network and 
the tree-average distance. Figure 4 exhibits a phyloge- 
netic X-network N = (V, A, r, X) with X = {1, 2, 3, 4, 
5, 6, 7, 8, 9, 10, 11} and root 1 which satisfies the 
hypotheses of Theorem 4.1. Observe that by Corollary 
3.4, N displays exactly 4 trees, and there are exactly 
four parent maps. Let m be a weight function on A 
such that m(a, b) = 0 when b is hybrid but m(a, b) > 0 
when b is normal. Let d(x, y) = d(x, y; N) denote the 
resulting tree-average distance between x and y in X. 
Suppose first that we assume equiprobability about the 
network, so each a(a, b) = 1/2 when b is hybrid. There 
are 24 arcs for which we compute the weights as 
follows: 

First, since 16 and 20 are hybrid, we have cu(17, 16) = 
tw(15, 16) = cu(21, 20) = co(23, 20) = 0. 
By Lemma 4.3(2), 



w(19, 8) = [d(8, 1) + d(7, 8) - d(l, 7)]/2, £o(19, 7) = [d 
(7, 1) + d(7, 8) - d{l, 8)]/2, and we similarly find «(14, 
3), cu(13, 2), and £o(22, 10). 

By Lemma 4.3(1), co(l, 22) = [d(l, 9) + d(l, 11) - d(9, 
ll)]/2. By Lemma 4.3(3), co(18, 19) = [d(l, 8) + d{6, 7) - 
d(l, 6) - d(7, 8)]/2, eo(12, 13) = [d(l, 2) + d(ll, 3) - d(l, 
11) - d{2, 3)]/2, and we similarly find ra(13, 14) and 
m{22, 12). 

By Lemma 4.5, w{20, 18) = [d{9, 8) + d(ll, 6) - d{9, 
11) - d{8, 6)]/2. By Lemma 4.4, (u(16, 5) = [d(5, 4) + d 
(5, 6) - d(4, 6)]/2. 

By Lemma 4.7 in the equiprobable case, a>(21, 9) = d 
(9, 7) - d(l, 7) + [d(l, 9) + d(l, 11) - d(9, ll)]/2, »(23, 
11) = d(ll, 7) - d(l, 7) + [d(l, 9) + d(l, 11) - d(9, 11)]/ 
2, and we similarly find cu(17, 6) and ra(15, 4). 

By Lemma 4.3(2), <i(18, 6) = [d(6, l) + d(6, 7) - d(l, 
7)]/2. But then £«(18, 17) = d(l8, 6) - (u(17, 6). 

Similarly by Lemma 4.2(2) J(14, 4) = [d(4, 1) + rf(4, 3) 
- d(l, 3)]/2 and then co(14, 15) = J(14, 4) - £«(15, 4). 

Similarly by Lemma 4.3(2) d(12, 11) = [d(ll, 1) + d(ll, 
2) - d(l, 2)]/2 and then co(12, 23) = <i(12, 11) - «(23, 11). 

Finally, a!(10, 9) is known since 10 e X, so &>(10, 21) = 
d(\0, 9) - co(21, 9). This concludes the calculation of all 
the weights for N in the equiprobable case. Note that in 
several of these calculations, there were alternative 
choices possible. For example, we also have a>{22, 12) = 
[d(l, 4) + d{9, 11) - d(l, 9) - d(4, ll)]/2. 

The general case where we do not assume equiprobabil- 
ity proceeds in a similar manner, different from the above 
only in the use of Lemma 4.9 in place of Lemma 4.7. We 
compute co(21, 9), (a(23, 11), a(21, 20), and a(23, 20) 
using Lemma 4.9 with X\ = 9, x 2 = 11, X3 = 2, and y = 7. 
We compute co(17, 6), co(15, 4), a(17, 16), and a(15, 16) 
using Lemma 4.9 with X\ = 6, x 2 = 4, x 3 = 3, y = 5. 

6 Extensions 

Theorem 4.1 applies only to normal phylogenetic net- 
works for which the indegree of each hybrid vertex is 2. 
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It would be interesting to see whether the same results 
are true without the restriction on the indegree of a 
hybrid vertex. Whereas I have verified this for several 
individual networks with vertices of indegree 3 or 4, I 
do not have a general proof. 

In the event of a true hybridization between two sex- 
ual species, it is plausible to assume that the indegree is 
2 and that each parent contributes approximately 
equally. Hence in this case it is plausible that we would 
obtain the tree-average distance utilized in Theorem 4.1. 
Nevertheless, backcrossing of the hybrid h with one of 
the parental species q\ could easily increase the fraction 
of the genome of q\ in h, changing it from 50%. Simi- 
larly, if the reticulation is actually a horizontal gene 
transfer, common between bacteria, then there is no 
guarantee that the sources contribute approximately 
equally. Hence the occurrence of probabilities different 
from 1/2 seems likely. 
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